feat: simulation search & tagging for large-scale campaigns#24
feat: simulation search & tagging for large-scale campaigns#24maxscheurer wants to merge 5 commits into
Conversation
* Add SLURM cluster autodiscovery module with unit tests * Fix partition state: report 'up' if any node is schedulable * Query real SLURM default account via sacctmgr show user * Parse default_time from sinfo %L field separately from max_time * Use tuple for NodeType.features to enforce full immutability * Add _run_command edge case tests (timeout, missing binary, nonzero exit) * Fix PLW2901 lint errors and reformat test file * feat: wire cluster autodiscovery into SlurmConfig and CLI (#12) * Add SlurmConfig.from_cluster() classmethod for autodiscovery * Use autodiscovery fallback when --account not provided in SLURM commands * Add tests for SlurmConfig.from_cluster() autodiscovery * Add tests for mdfactory config cluster CLI command * fix new NodeType signature in tests * Slurm settings taking precendence * Convert dataclasses to Pydantic models for consistency * Refactor subprocess execution into generalized run_command utility * add BaseSlurmConfig/SlurmConfig/normalize_slurm_time to performance package * migrate submit.py: replace @DataClass SlurmConfig with re-export shim from performance * split slurm ini keys to PARTITION_CPU/PARTITION_GPU/DEFAULT_QOS, add settings properties * add test_slurm_config: BaseSlurmConfig 3-tier precedence, SlurmConfig model, from_yaml * update test_submit: Pydantic SlurmConfig, min_cpus partition selection, slurm_partition_cpu * fix from_cluster() return type to Self, path: Path|str on from_yaml, tighten docstring * fix cli.py: partition=None sentinel, forward min_cpus/min_mem_gb to from_cluster() * add no_slurm_settings fixture; apply to isolation-sensitive from_cluster tests * extend no_slurm_settings to positive-path autodiscovery tests * pre-commit fixes * add "error" to is_broken node states * replace lru_cache with sentinel; don't cache transient sinfo failures * use mutable list cell for cluster cache; fix ruff PLW0603 * guard int() in 3-part GRES branch against malformed entries * render typeless GPU as 'unknown' in config cluster display * add tests: memory trailing-plus, node-count tiebreak, cluster-None+no-partition path * add tests: analysis_run/artifacts_run SLURM autodiscovery glue * add develop branch to CI pull_request trigger
* chore: open PR for issue 16 (PR 1 — Parsl build orchestration)
* feat: add Parsl-based parallel build orchestration
- New mdfactory/orchestration/ module (config, apps, build)
- ExecutorConfig/SlurmExecutorConfig with SLURM-native field names
- build_systems() dispatches parallel builds via @python_app
- build_systems_dry_run() previews without loading Parsl
- Rich live progress display with activity log
- Ctrl+C triggers parsl.clear() to cancel SLURM jobs
- Extended CLI: mdfactory build accepts CSV/YAML with --config/--dry-run
- Added parsl to optional-dependencies and pixi dev env
- 19 tests covering config, build dispatch, and error handling
* fix: explicit scancel on shutdown, full error messages in TUI
- _shutdown_parsl() extracts SLURM job IDs from DFK and runs scancel
- Ctrl+C now reliably cancels SLURM jobs instead of relying on parsl.clear()
- Error messages no longer truncated in activity log
- Increased activity log to 12 entries
* feat: show SLURM job status (running/pending) in build TUI
* style: add icon to SLURM status line for consistent indent
* fix: force SOL molecule name for water SMIRNOFF parametrization
When species.resname is not 'SOL' (e.g. 'WAT'), Interchange would
generate atom types with the wrong prefix (WAT_0 instead of SOL_0),
causing a KeyError in OpenMM's GromacsTopFile parser.
Force molecule.name='SOL' before Interchange export so atom types are
always consistent (SOL_0, SOL_1, SOL_2). Also invalidate stale
SOL_params.itp when the ITP is regenerated.
* fix: address review findings — lazy parsl imports, DFK lifecycle, tests
- Move parsl imports inside functions (optional-dependency pattern)
- Wrap build_systems() in try/finally for DFK cleanup on exceptions
- Split _shutdown_parsl() into independent blocks, log at WARNING
- Normalize single-YAML output directory to output/{hash}/
- Add pytest.importorskip guard to orchestration test files
- Add tests: _build_system_impl, SLURM cleanup, KeyboardInterrupt,
dict input path, CLI integration (9 tests), parametrize regression
- Add PLC0415 to ruff ignore (lazy imports intentional throughout)
* fix: address re-review findings — input validation, dry-run guard, tests
* refactor: collapse dry-run into build_systems(), remove build_systems_dry_run
* feat: generate summary YAML when building from CSV (matches prepare-build output)
* feat: integrate PR #11 SLURM infrastructure — from_cluster(), TUI wizard, --slurm flag
* refactor: eliminate duplication — shared resolve_slurm_fields(), CLI helpers, DFK guard
* fix: address review findings — lazy questionary import, run_dir, BMP symbols
- Lazy-import questionary (in [parsl] extra) so importing mdfactory.orchestration
no longer fails for users without the extra (matches parsl/rich pattern)
- Add run_dir field to ExecutorConfig (default ~/.parsl/mdfactory) wired into
parsl.Config so runinfo/ no longer scatters into the working directory;
Path serializer keeps YAML round-trips working
- Clarify cpus_per_node docstring (Parsl cores_per_node != --cpus-per-task)
- Replace print() with Console().print() in the SLURM TUI
- Restrict source symbols to UTF-8 BMP (replace non-BMP emoji in build progress)
- Update TUI tests to patch _import_questionary
* feat: prepare Parsl foundation — reusable session context manager + forward-looking config
- Extract generic Parsl lifecycle into mdfactory/orchestration/session.py:
parsl_session(config) context manager owns the DFK guard, load, and
shutdown (+ scancel) so simulate_systems() / benchmark sweeps reuse it
via 'with parsl_session(config): submit(); wait()'. ParslSession.detach()
covers the wait=False ownership-transfer case. build.py re-exports the
moved helpers for backward compatibility
- Parametrize _wait_with_progress(label=...) so the progress display is
reusable for simulation and benchmark workflows
- Add available_accelerators to ExecutorConfig for GPU worker pinning,
wired into both local and SLURM HighThroughputExecutors
- Add launch_options on SlurmExecutorConfig -> SrunLauncher(overrides=...)
for srun-level task placement / NUMA binding; default launcher untouched
when unset
- Prompt for the constraint field in the SLURM wizard
- Tests: new test_orchestration_session.py plus config/TUI coverage
* refactor: SlurmExecutorConfig inherits BaseSlurmConfig — eliminate field/from_cluster duplication
- SlurmExecutorConfig now inherits (ExecutorConfig, BaseSlurmConfig) instead
of redeclaring account/partition/qos/constraint and reimplementing
from_cluster() with an identical body
- Set model_config = ConfigDict(frozen=False) to resolve the frozen mismatch
(BaseSlurmConfig is frozen; executor configs are mutated by the TUI wizard)
- Single source of truth for SLURM scheduling fields across the submitit
(SlurmConfig) and Parsl (SlurmExecutorConfig) backends — advances issue #20
and prevents a divergent third copy as submitit is phased out
- Add regression tests asserting the inheritance + mutability so the dedup
cannot silently regress
* fix: enrich build failure metadata and guard result completeness
- Add _describe_failure() returning (failure_type, error_detail); failed
build results now carry failure_type/error_detail alongside error, so
future retry logic can distinguish a GROMACS crash from infrastructure
failures. Uses the actual re-raised exception (modern Parsl) and only
falls back to legacy .e_value defensively — not the inaccurate AppFailure
path from the original review note
- Add _collect_results() that raises RuntimeError naming uncaptured hashes
instead of silently returning a shorter list than was submitted
(defensive guard against a future polling-loop bug)
- Tests: unit-cover both helpers (plain + legacy-wrapped exception,
complete + uncaptured-slot) and assert the enriched failure dict
Implementation PlanAnalysisThe issue has three interconnected features: (1) tags on Critical technical finding: The CSV dot-notation in DeliverablesStage 1: Tags + Discovery Fixes
Stage 2: CLI Search Command
Stage 3: Textual TUI
Acceptance Criteria
Files to ModifyStage 1:
Stage 2:
Stage 3:
Design Principles
Testing Approach
Risks and Open Questions
Plan created by mach6 |
- Add tags field to BuildInput (excluded from hash for stability) - Add min_status param to SimulationStore.discover() - Add SimulationStore.search() with type/status/hash/tags/smiles filters - Add smiles_substructure_match() in chemistry_utilities - Add 'mdfactory search' CLI command with Rich table output - Add 'mdfactory browse' TUI with live-filtering DataTable - Add [tui] optional extra (textual>=0.50)
Progress UpdateImplementation of issue 23 — simulation search, tagging & TUI browser: Delivered
Not included
Commit: Progress tracked by mach6 |
Code ReviewCriticalNo critical findings. ImportantFinding 1 — Infinite poll loop when a Parsl future never completes (error-auditor, confidence: 85) Finding 2 — Schema-specific filters not implemented (completeness-checker, confidence: 95) Finding 3 — Finding 4 — Finding 5 — Chemistry extractor functions and SuggestionsFinding 6 — Finding 7 — CLI Finding 8 — Preprocessing subprocess has no timeout (error-auditor, confidence: 90) Finding 9 — Finding 10 — Documentation and examples for tagging workflow absent (completeness-checker, confidence: 95) Finding 11 — TUI Finding 12 — Finding 13 — Three-way duplicate file-output sink in CLI pull commands (simplifier, confidence: 95) Finding 14 — Two-way duplicate SLURM config resolution block (simplifier, confidence: 93) Finding 15 — TOCTOU race on Finding 16 — Finding 17 — Imports inside per-row loop in Finding 18 — Finding 19 — Additional simplification opportunities in CLI (simplifier, confidence: 85–92) Strengths
Agents run: code-reviewer, error-auditor, test-reviewer, completeness-checker, simplifier Reviewed by mach6 |
Review AssessmentClassifications
Action Plan
Deferred (track separately):
Assessment by mach6 |
- Add stall_timeout detection to Parsl poll loop (finding 1) - Add tests for search(smiles=...) integration (finding 3) - Add tests for search(status=...) threshold logic (finding 4) - Add tests for chemistry extractors (finding 5) - Fix @work race with exclusive=True in TUI (finding 6) - Add ValueError handling in CLI search command (finding 7) - Extract _write_output helper for CLI file output (finding 13) - Extract _resolve_slurm_config helper (finding 14) - Remove dead Select.BLANK guards in TUI (finding 16) - Hoist imports above loop for fail-fast in search() (finding 17) - Add test for remove_all_analyses type filter (finding 18)
Progress UpdateAddressed 11 review findings from the mach6-review: Bug fixes
Refactoring
Tests added (19 new tests)
Commit: Progress tracked by mach6 |
Closes #23
Add user-defined metadata tags on BuildInput, discovery/store improvements (min_status passthrough), CLI search command, and interactive Textual-based TUI for browsing and filtering large simulation campaigns.
Implementation plan posted as a comment below.